Fast Speaker Diarization Using a Specialization Framework for Gaussian Mixture Model Training
نویسندگان
چکیده
Most current speaker diarization systems use agglomerative clustering of Gaussian Mixture Models (GMMs) to determine “who spoke when” in an audio recording. While state-of-the-art in accuracy, this method is computationally costly, mostly due to the GMM training, and thus limits the performance of current approaches to be roughly real-time. Increased sizes of current datasets require processing of hundreds of hours of data and thus make more efficient processing methods highly desirable. With the emergence of highly parallel multicore and manycore processors, such as graphics processing units (GPUs), one can re-implement GMM training to achieve faster than real-time performance by taking advantage of parallelism in the training computation. However, developing and maintaining the complex low-level GPU code is difficult and requires a deep understanding of the hardware architecture of the parallel processor. Furthermore, such low-level implementations are not readily reusable in other applications and not portable to other platforms, limiting programmer productivity. In this thesis we present a Python-based GMM training specialization framework that abstracts lowlevel GPU code and instead automatically selects the best parallel implementation of the training algorithm based on the diarization problem size and the processor features and maps the computation onto the parallel platform. Our specialization framework can automatically map the GMM training algorithm onto any CUDA-programmable NVIDIA GPU as well as multi-core CPUs. We then present a full speaker diarization system captured in about 50 lines of Python that uses our specialization framework and achieves 37166× faster than real-time performance without significant loss in accuracy. We also investigate a trade-off between diarization accuracy and performance and show that for a 3% gain in Diarization Error Rate (DER), the application performance decreases by 5×. Using our framework allows the scientist to focus on developing the application algorithm while achieving significant additional performance improvements by automatically utilizing parallel hardware.
منابع مشابه
Chapter 6 Introduction to Speech and Multimedia Applications in the Par Lab
Most current speaker diarization systems use agglomerative clustering of Gaussian Mixture Models (GMMs) to determine “who spoke when" in an audio recording. While state-of-the-art in accuracy, this method is computationally costly, mostly due to the GMM training, and thus limits the performance of current approaches to be roughly real-time. Increased sizes of current datasets require processing...
متن کاملSpeaker Diarization Using Gaussian Mixture Turns and Segment Matching
Speaker diarization aims to detect “who spoke when” in large audio segments. It is an important task in processing of broadcast news audio, making easier the audio segments selection and indexing task. In this paper an unsupervised speaker diarization scheme is proposed using a Gaussian Mixture Model as a Universal Background Model, Bayesian Information Criterion and fingerprint detection. A de...
متن کاملThe Approach of Speaker Diarization by Gaussian Mixture Model (GMM)
Speaker identification is an important activity in the process of speaker diarization. We need to model the speaker by Gaussian mixture model (GMM) for speaker identification purpose. Large GMM is called as a Universal Background Model (UBM) which is adapted into each speaker model for speaker identification purpose. This paper focuses on speech clustering for speaker diarization. The speaker d...
متن کاملCross Likelihood Ratio Based Speaker Clustering Using Eigenvoice Models
This paper proposes the use of eigenvoice modeling techniques with the Cross Likelihood Ratio (CLR) as a criterion for speaker clustering within a speaker diarization system. The CLR has previously been shown to be a robust decision criterion for speaker clustering using Gaussian Mixture Models. Recently, eigenvoice modeling techniques have become increasingly popular, due to its ability to ade...
متن کاملOn the use of GSV-SVM for Speaker Diarization and Tracking
In this paper, we present the use of Gaussian Supervectors with Support Vector Machines classifiers (GSV-SVM) in an acoustic speaker diarization and a speaker tracking system, compared with a standard Gaussian Mixture Model system based on adapted Universal Background Models (GMM-UBM). GSVSVM systems (which share the adaptation step with the GMMUBM systems) are observed to have comparable perfo...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011